🔪Crime Analysis in the City of Los Angeles🌴#
Author:Dan Xu
Course Project, UC Irvine, Math 10, Summer 2023
👉Introduction#
In light of the recent surge in crime rates, there is a growing concern about the safety of our society. To address this issue, I am embarking on a research project that leverages a dataset sourced from the Los Angeles Police Department (LAPD), encompassing crime data spanning from the year 2020 to the present day. This dataset provides a valuable resource for me to investigate whether Female Victims are more than Male,also to predict the total number of Victims per Month. Additionally, I try to use the Linear Classification to determine the gender of the Victim, and use K-Means to see how cluster partition our data. My goal is to explore these assumption comprehensively, shedding light on any potential correlations and contributing to our understanding of the dynamics behind the criminal activity.
📚Definition and Description#
Below, you’ll find explanations for each column in the dataset, outlining their respective meanings.
DR_NO: Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits.
Date Rptd: The reported date.
DATE OCC: The crime occurrence date.
TIME OCC: The occurrence time.
AREA: The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
AREA NAME: The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles.
Rpt Dist No: A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the “RD” that it occurred in for statistical comparisons. Find LAPD Reporting Districts on the LA City GeoHub at http://geohub.lacity.org/datasets/c4f83909b81d4786aa8ba8a74.
Part 1-2: Number of parties.
Crm Cd: Indicates the crime committed. (Same as Crime Code 1).
Crm Cd Desc: Defines the Crime Code provided.
Mocodes: Modus Operandi: Activities associated with the suspect in commission of the crime.See attached PDF for list of MO Codes in numerical order. https://data.lacity.org/api/views/y8tr-7khq/files/3a967fbd-f210-4857-bc52-60230efe256c?download=true&filename=MO CODES (numerical%20order.
Vict Age: The victim age.
Vict Sex: Gender,F - Female M - Male X - Unknown.
Vict Descent: Descent Code: A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian Indian.
Premis Cd: The type of structure, vehicle, or location where the crime took place.
Premis Desc: Defines the Premise Code provided.
Weapon Used Cd: The type of weapon used in the crime.
Weapon Desc: Defines the Weapon Used Code provided.
Status: Status of the case. (IC is the default).
Status Desc: Defines the Status Code provided.
Crm Cd 1: Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious.
Crm Cd 2: May contain a code for an additional crime, less serious than Crime Code 1.
Crm Cd 3: May contain a code for an additional crime, less serious than Crime Code 1.
Crm Cd 4: May contain a code for an additional crime, less serious than Crime Code 1.
LOCATION: Street address of crime incident rounded to the nearest hundred block to maintain anonymity.
Cross Street: Cross Street of rounded Address.
LAT: Latitude.
LON: Longtitude.
🤔Assumption#
Before we begin exploring our data, let’s establish an initial assumption. Afterward, we’ll verify the accuracy of our assumption through data exploration. Assumption: Female victims may be more than male victims.
💻Explore the Date#
We’ll import the dataset, and look a few rows to get some idea what information we have.
import pandas as pd
df = pd.read_csv("Crime_Data.csv")
# Let's get some insight about our data
df.head()
| DR_NO | Date Rptd | DATE OCC | TIME OCC | AREA | AREA NAME | Rpt Dist No | Part 1-2 | Crm Cd | Crm Cd Desc | ... | Status | Status Desc | Crm Cd 1 | Crm Cd 2 | Crm Cd 3 | Crm Cd 4 | LOCATION | Cross Street | LAT | LON | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10304468 | 01/08/2020 12:00:00 AM | 01/08/2020 12:00:00 AM | 2230 | 3 | Southwest | 377 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | AO | Adult Other | 624.0 | NaN | NaN | NaN | 1100 W 39TH PL | NaN | 34.0141 | -118.2978 |
| 1 | 190101086 | 01/02/2020 12:00:00 AM | 01/01/2020 12:00:00 AM | 330 | 1 | Central | 163 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | IC | Invest Cont | 624.0 | NaN | NaN | NaN | 700 S HILL ST | NaN | 34.0459 | -118.2545 |
| 2 | 201220752 | 09/16/2020 12:00:00 AM | 09/16/2020 12:00:00 AM | 1230 | 12 | 77th Street | 1259 | 2 | 745 | VANDALISM - MISDEAMEANOR ($399 OR UNDER) | ... | IC | Invest Cont | 745.0 | NaN | NaN | NaN | 700 E 73RD ST | NaN | 33.9739 | -118.2630 |
| 3 | 191501505 | 01/01/2020 12:00:00 AM | 01/01/2020 12:00:00 AM | 1730 | 15 | N Hollywood | 1543 | 2 | 745 | VANDALISM - MISDEAMEANOR ($399 OR UNDER) | ... | IC | Invest Cont | 745.0 | 998.0 | NaN | NaN | 5400 CORTEEN PL | NaN | 34.1685 | -118.4019 |
| 4 | 191921269 | 01/01/2020 12:00:00 AM | 01/01/2020 12:00:00 AM | 415 | 19 | Mission | 1998 | 2 | 740 | VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... | ... | IC | Invest Cont | 740.0 | NaN | NaN | NaN | 14400 TITUS ST | NaN | 34.2198 | -118.4468 |
5 rows × 28 columns
What I’ll do next is to convert TIME OCC into a real 24 hour time format instead of integers. The resources I used will be attached to the end of the project.
df["TIME OCC"] = df["TIME OCC"].apply(lambda time: str(time).zfill(4)[:2] + ':' + str(time).zfill(4)[2:])
df
| DR_NO | Date Rptd | DATE OCC | TIME OCC | AREA | AREA NAME | Rpt Dist No | Part 1-2 | Crm Cd | Crm Cd Desc | ... | Status | Status Desc | Crm Cd 1 | Crm Cd 2 | Crm Cd 3 | Crm Cd 4 | LOCATION | Cross Street | LAT | LON | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10304468 | 01/08/2020 12:00:00 AM | 01/08/2020 12:00:00 AM | 22:30 | 3 | Southwest | 377 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | AO | Adult Other | 624.0 | NaN | NaN | NaN | 1100 W 39TH PL | NaN | 34.0141 | -118.2978 |
| 1 | 190101086 | 01/02/2020 12:00:00 AM | 01/01/2020 12:00:00 AM | 03:30 | 1 | Central | 163 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | IC | Invest Cont | 624.0 | NaN | NaN | NaN | 700 S HILL ST | NaN | 34.0459 | -118.2545 |
| 2 | 201220752 | 09/16/2020 12:00:00 AM | 09/16/2020 12:00:00 AM | 12:30 | 12 | 77th Street | 1259 | 2 | 745 | VANDALISM - MISDEAMEANOR ($399 OR UNDER) | ... | IC | Invest Cont | 745.0 | NaN | NaN | NaN | 700 E 73RD ST | NaN | 33.9739 | -118.2630 |
| 3 | 191501505 | 01/01/2020 12:00:00 AM | 01/01/2020 12:00:00 AM | 17:30 | 15 | N Hollywood | 1543 | 2 | 745 | VANDALISM - MISDEAMEANOR ($399 OR UNDER) | ... | IC | Invest Cont | 745.0 | 998.0 | NaN | NaN | 5400 CORTEEN PL | NaN | 34.1685 | -118.4019 |
| 4 | 191921269 | 01/01/2020 12:00:00 AM | 01/01/2020 12:00:00 AM | 04:15 | 19 | Mission | 1998 | 2 | 740 | VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... | ... | IC | Invest Cont | 740.0 | NaN | NaN | NaN | 14400 TITUS ST | NaN | 34.2198 | -118.4468 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 317849 | 211208872 | 03/19/2021 12:00:00 AM | 03/19/2021 12:00:00 AM | 11:05 | 12 | 77th Street | 1218 | 1 | 510 | VEHICLE - STOLEN | ... | IC | Invest Cont | 510.0 | NaN | NaN | NaN | 58TH ST | FIGUEROA ST | 33.9897 | -118.2827 |
| 317850 | 210506531 | 03/04/2021 12:00:00 AM | 03/04/2021 12:00:00 AM | 22:10 | 5 | Harbor | 564 | 2 | 434 | FALSE IMPRISONMENT | ... | AA | Adult Arrest | 434.0 | NaN | NaN | NaN | 200 W 2ND ST | NaN | 33.7424 | -118.2814 |
| 317851 | 211710505 | 07/09/2021 12:00:00 AM | 07/09/2021 12:00:00 AM | 10:50 | 17 | Devonshire | 1798 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | IC | Invest Cont | 624.0 | NaN | NaN | NaN | 8800 DEMPSEY AV | NaN | 34.2302 | -118.4775 |
| 317852 | 210312887 | 07/12/2021 12:00:00 AM | 07/12/2021 12:00:00 AM | 12:00 | 3 | Southwest | 363 | 1 | 350 | THEFT, PERSON | ... | IC | Invest Cont | 350.0 | NaN | NaN | NaN | CRENSHAW BL | STOCKER ST | 34.0088 | -118.3351 |
| 317853 | 212005847 | 02/22/2021 12:00:00 AM | 02/22/2021 12:00:00 AM | 12:00 | 20 | Olympic | 2034 | 1 | 510 | VEHICLE - STOLEN | ... | IC | Invest Cont | 510.0 | NaN | NaN | NaN | 3300 W 8TH ST | NaN | 34.0596 | -118.3022 |
317854 rows × 28 columns
Then we’ll create new columns, which contain the month and hour. Before we create the columns, we need to convert those DATE OCC and TIME OCC into datetime type, then we can use .dt accessor to get the value of month and hour.
# convert date occ to datetime, add month columns.
df["DATE OCC"] = pd.to_datetime(df["DATE OCC"])
df["Month"] = df["DATE OCC"].dt.month
df["TIME OCC"] = pd.to_datetime(df["TIME OCC"])
df["Hour"] = df["TIME OCC"].dt.hour
df
| DR_NO | Date Rptd | DATE OCC | TIME OCC | AREA | AREA NAME | Rpt Dist No | Part 1-2 | Crm Cd | Crm Cd Desc | ... | Crm Cd 1 | Crm Cd 2 | Crm Cd 3 | Crm Cd 4 | LOCATION | Cross Street | LAT | LON | Month | Hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10304468 | 01/08/2020 12:00:00 AM | 2020-01-08 | 2023-09-27 22:30:00 | 3 | Southwest | 377 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | 624.0 | NaN | NaN | NaN | 1100 W 39TH PL | NaN | 34.0141 | -118.2978 | 1 | 22 |
| 1 | 190101086 | 01/02/2020 12:00:00 AM | 2020-01-01 | 2023-09-27 03:30:00 | 1 | Central | 163 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | 624.0 | NaN | NaN | NaN | 700 S HILL ST | NaN | 34.0459 | -118.2545 | 1 | 3 |
| 2 | 201220752 | 09/16/2020 12:00:00 AM | 2020-09-16 | 2023-09-27 12:30:00 | 12 | 77th Street | 1259 | 2 | 745 | VANDALISM - MISDEAMEANOR ($399 OR UNDER) | ... | 745.0 | NaN | NaN | NaN | 700 E 73RD ST | NaN | 33.9739 | -118.2630 | 9 | 12 |
| 3 | 191501505 | 01/01/2020 12:00:00 AM | 2020-01-01 | 2023-09-27 17:30:00 | 15 | N Hollywood | 1543 | 2 | 745 | VANDALISM - MISDEAMEANOR ($399 OR UNDER) | ... | 745.0 | 998.0 | NaN | NaN | 5400 CORTEEN PL | NaN | 34.1685 | -118.4019 | 1 | 17 |
| 4 | 191921269 | 01/01/2020 12:00:00 AM | 2020-01-01 | 2023-09-27 04:15:00 | 19 | Mission | 1998 | 2 | 740 | VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... | ... | 740.0 | NaN | NaN | NaN | 14400 TITUS ST | NaN | 34.2198 | -118.4468 | 1 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 317849 | 211208872 | 03/19/2021 12:00:00 AM | 2021-03-19 | 2023-09-27 11:05:00 | 12 | 77th Street | 1218 | 1 | 510 | VEHICLE - STOLEN | ... | 510.0 | NaN | NaN | NaN | 58TH ST | FIGUEROA ST | 33.9897 | -118.2827 | 3 | 11 |
| 317850 | 210506531 | 03/04/2021 12:00:00 AM | 2021-03-04 | 2023-09-27 22:10:00 | 5 | Harbor | 564 | 2 | 434 | FALSE IMPRISONMENT | ... | 434.0 | NaN | NaN | NaN | 200 W 2ND ST | NaN | 33.7424 | -118.2814 | 3 | 22 |
| 317851 | 211710505 | 07/09/2021 12:00:00 AM | 2021-07-09 | 2023-09-27 10:50:00 | 17 | Devonshire | 1798 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | 624.0 | NaN | NaN | NaN | 8800 DEMPSEY AV | NaN | 34.2302 | -118.4775 | 7 | 10 |
| 317852 | 210312887 | 07/12/2021 12:00:00 AM | 2021-07-12 | 2023-09-27 12:00:00 | 3 | Southwest | 363 | 1 | 350 | THEFT, PERSON | ... | 350.0 | NaN | NaN | NaN | CRENSHAW BL | STOCKER ST | 34.0088 | -118.3351 | 7 | 12 |
| 317853 | 212005847 | 02/22/2021 12:00:00 AM | 2021-02-22 | 2023-09-27 12:00:00 | 20 | Olympic | 2034 | 1 | 510 | VEHICLE - STOLEN | ... | 510.0 | NaN | NaN | NaN | 3300 W 8TH ST | NaN | 34.0596 | -118.3022 | 2 | 12 |
317854 rows × 30 columns
We can see in the Vict Age column has 0 age, it’s not useful for us to analyze, then we’ll drop these rows.
df = df[~df.isin([0]).any(axis=1)]
df
| DR_NO | Date Rptd | DATE OCC | TIME OCC | AREA | AREA NAME | Rpt Dist No | Part 1-2 | Crm Cd | Crm Cd Desc | ... | Crm Cd 1 | Crm Cd 2 | Crm Cd 3 | Crm Cd 4 | LOCATION | Cross Street | LAT | LON | Month | Hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10304468 | 01/08/2020 12:00:00 AM | 2020-01-08 | 2023-09-27 22:30:00 | 3 | Southwest | 377 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | 624.0 | NaN | NaN | NaN | 1100 W 39TH PL | NaN | 34.0141 | -118.2978 | 1 | 22 |
| 1 | 190101086 | 01/02/2020 12:00:00 AM | 2020-01-01 | 2023-09-27 03:30:00 | 1 | Central | 163 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | 624.0 | NaN | NaN | NaN | 700 S HILL ST | NaN | 34.0459 | -118.2545 | 1 | 3 |
| 2 | 201220752 | 09/16/2020 12:00:00 AM | 2020-09-16 | 2023-09-27 12:30:00 | 12 | 77th Street | 1259 | 2 | 745 | VANDALISM - MISDEAMEANOR ($399 OR UNDER) | ... | 745.0 | NaN | NaN | NaN | 700 E 73RD ST | NaN | 33.9739 | -118.2630 | 9 | 12 |
| 3 | 191501505 | 01/01/2020 12:00:00 AM | 2020-01-01 | 2023-09-27 17:30:00 | 15 | N Hollywood | 1543 | 2 | 745 | VANDALISM - MISDEAMEANOR ($399 OR UNDER) | ... | 745.0 | 998.0 | NaN | NaN | 5400 CORTEEN PL | NaN | 34.1685 | -118.4019 | 1 | 17 |
| 4 | 191921269 | 01/01/2020 12:00:00 AM | 2020-01-01 | 2023-09-27 04:15:00 | 19 | Mission | 1998 | 2 | 740 | VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... | ... | 740.0 | NaN | NaN | NaN | 14400 TITUS ST | NaN | 34.2198 | -118.4468 | 1 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 317847 | 212000771 | 05/28/2021 12:00:00 AM | 2021-05-28 | 2023-09-27 19:30:00 | 20 | Olympic | 2056 | 2 | 888 | TRESPASSING | ... | 888.0 | 998.0 | NaN | NaN | 900 S BERENDO ST | NaN | 34.0527 | -118.2937 | 5 | 19 |
| 317848 | 212110947 | 07/04/2021 12:00:00 AM | 2021-07-04 | 2023-09-27 21:35:00 | 21 | Topanga | 2143 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | 624.0 | NaN | NaN | NaN | 23100 FRIAR ST | NaN | 34.1855 | -118.6296 | 7 | 21 |
| 317850 | 210506531 | 03/04/2021 12:00:00 AM | 2021-03-04 | 2023-09-27 22:10:00 | 5 | Harbor | 564 | 2 | 434 | FALSE IMPRISONMENT | ... | 434.0 | NaN | NaN | NaN | 200 W 2ND ST | NaN | 33.7424 | -118.2814 | 3 | 22 |
| 317851 | 211710505 | 07/09/2021 12:00:00 AM | 2021-07-09 | 2023-09-27 10:50:00 | 17 | Devonshire | 1798 | 2 | 624 | BATTERY - SIMPLE ASSAULT | ... | 624.0 | NaN | NaN | NaN | 8800 DEMPSEY AV | NaN | 34.2302 | -118.4775 | 7 | 10 |
| 317852 | 210312887 | 07/12/2021 12:00:00 AM | 2021-07-12 | 2023-09-27 12:00:00 | 3 | Southwest | 363 | 1 | 350 | THEFT, PERSON | ... | 350.0 | NaN | NaN | NaN | CRENSHAW BL | STOCKER ST | 34.0088 | -118.3351 | 7 | 12 |
229379 rows × 30 columns
I’ll select many different columns to create a new datafram, and drop any missing value.
df_sub = df[["DATE OCC","Vict Sex","AREA NAME","Vict Age","Vict Descent","Hour","Month"]].dropna().copy()
df_sub
| DATE OCC | Vict Sex | AREA NAME | Vict Age | Vict Descent | Hour | Month | |
|---|---|---|---|---|---|---|---|
| 0 | 2020-01-08 | F | Southwest | 36 | B | 22 | 1 |
| 1 | 2020-01-01 | M | Central | 25 | H | 3 | 1 |
| 2 | 2020-09-16 | M | 77th Street | 62 | B | 12 | 9 |
| 3 | 2020-01-01 | F | N Hollywood | 76 | W | 17 | 1 |
| 4 | 2020-01-01 | X | Mission | 31 | X | 4 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 317847 | 2021-05-28 | M | Olympic | 29 | H | 19 | 5 |
| 317848 | 2021-07-04 | M | Topanga | 44 | W | 21 | 7 |
| 317850 | 2021-03-04 | F | Harbor | 41 | B | 22 | 3 |
| 317851 | 2021-07-09 | M | Devonshire | 40 | H | 10 | 7 |
| 317852 | 2021-07-12 | F | Southwest | 15 | H | 12 | 7 |
229370 rows × 7 columns
📈 Plotting Chart#
Then we’ll plot many differnt charts. Before we do that, we need to import Altair first.
import altair as alt
Since we have a large dataset, whchi will exceed the number of altair can accept, then we need to random select our data, and to make sure the random reault will not change everytime.
df_pre = df_sub.sample(5000,random_state=50)
We’ll see the total number of Victims per month in the below chart.
total_victim_chart = alt.Chart(df_pre).mark_bar(size = 20).encode(
x = alt.X("Month"),
y = alt.Y("count(Vict Sex)"),
color = alt.Color('Month', scale=alt.Scale(scheme='redpurple')),
tooltip = ["count(Vict Age)"]
).properties(title = "Total Victims Per Month")
total_victim_chart
From above, we could see the number of Victims in January to July is greater than August to September.
We’ll keep exploring data to see the number of Female and Male in each month. To compare female and male, we need to sperate our data into two sub dataframe, one is the dataframe containing only Female Victims, anothe is containing Male Victims, then to plot a chart.
df_sub1 = df_pre[df_pre["Vict Sex"].str.contains("F")]
df_sub1
| DATE OCC | Vict Sex | AREA NAME | Vict Age | Vict Descent | Hour | Month | |
|---|---|---|---|---|---|---|---|
| 265084 | 2021-03-09 | F | West LA | 19 | A | 19 | 3 |
| 128226 | 2020-08-09 | F | Pacific | 31 | W | 19 | 8 |
| 88874 | 2020-08-21 | F | Mission | 57 | O | 11 | 8 |
| 189547 | 2020-12-19 | F | Southwest | 28 | B | 8 | 12 |
| 143058 | 2020-06-15 | F | Van Nuys | 5 | B | 12 | 6 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 231567 | 2021-01-19 | F | West LA | 29 | W | 20 | 1 |
| 251113 | 2021-02-10 | F | Van Nuys | 31 | H | 8 | 2 |
| 271580 | 2021-01-07 | F | 77th Street | 34 | H | 8 | 1 |
| 266620 | 2021-07-24 | F | N Hollywood | 36 | W | 6 | 7 |
| 195530 | 2020-06-22 | F | West Valley | 16 | H | 17 | 6 |
2395 rows × 7 columns
df_sub2 = df_pre[df_pre["Vict Sex"].str.contains("M")]
df_sub2
| DATE OCC | Vict Sex | AREA NAME | Vict Age | Vict Descent | Hour | Month | |
|---|---|---|---|---|---|---|---|
| 297056 | 2021-02-14 | M | Olympic | 54 | H | 12 | 2 |
| 314052 | 2021-02-02 | M | Devonshire | 43 | B | 11 | 2 |
| 135737 | 2020-08-30 | M | Rampart | 26 | W | 17 | 8 |
| 231820 | 2021-04-12 | M | Devonshire | 32 | B | 16 | 4 |
| 171612 | 2020-12-07 | M | Olympic | 25 | H | 8 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 258142 | 2021-04-27 | M | Southeast | 39 | B | 15 | 4 |
| 28544 | 2020-05-12 | M | 77th Street | 56 | H | 23 | 5 |
| 63843 | 2020-09-22 | M | West Valley | 33 | O | 9 | 9 |
| 79708 | 2020-04-29 | M | Wilshire | 30 | C | 12 | 4 |
| 189307 | 2020-12-26 | M | West Valley | 53 | A | 21 | 12 |
2559 rows × 7 columns
female_victim = alt.Chart(df_sub1).mark_bar(size = 20).encode(
x= "Month",
y= "count(Vict Sex)",
color = alt.Color('Month', scale=alt.Scale(scheme='blues')),
tooltip = ["count(Vict Sex)"]
).properties(title = "Total Female Victims Per Month")
male_victim = alt.Chart(df_sub2).mark_line(color="red").encode(
x= "Month",
y= "count(Vict Sex)",
tooltip = ["count(Vict Sex)"]
).properties(title = "Total Male Victims Per Month")
female_victim | male_victim
(female_victim + male_victim).properties(title = "Female Victims Versus Male Victims")
From the graph above, we can check our assumption is False. The computer generated the random data for us, and the result shows the male victims is greater than the female victims.
We can also look at the Victims’ Descent.
descent_victim = alt.Chart(df_sub1).mark_bar(size = 20).encode(
x= "Vict Descent",
y= "count(Vict Descent)",
color = "count(Vict Descent)",
tooltip = ["count(Vict Descent)"]
).properties(title = "Total Number of Different Descent Victims")
descent_victim
We can see that Hispanic/Latin/Mexican Victims is greater than the other descent.
Let us see the median age of Victims in different Area.
df_area = df_pre.groupby("AREA NAME")
def make_chart(df_pre):
chart = alt.Chart(df_pre).mark_line().encode(
x="Month:N",
y="median(Vict Age)",
tooltip=["Month","median(Vict Age)"]
)
return chart+chart.mark_circle()
make_chart(df_area.get_group("Central"))
chart_list0 = [make_chart(sub_df) for ind_name,sub_df in df_area]
chart_list = [make_chart(sub_df).properties(title=f"Area Name: {ind_name}") for ind_name,sub_df in df_area]
alt.vconcat(*chart_list)
🗂Train our Dataset#
What I’ll do next is trying to train our data, and using decision tree and k-nearest neighbors to see which one has the most accurate prediction rate. What we’ll predict is based on the Age, hour,month and year to predictc the Vict sex.
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.linear_model import LinearRegression
First, we’ll add a new column named Year to our dataframe.
df_pre["Year"] = df_sub["DATE OCC"].dt.year
df_pre
| DATE OCC | Vict Sex | AREA NAME | Vict Age | Vict Descent | Hour | Month | Year | |
|---|---|---|---|---|---|---|---|---|
| 297056 | 2021-02-14 | M | Olympic | 54 | H | 12 | 2 | 2021 |
| 314052 | 2021-02-02 | M | Devonshire | 43 | B | 11 | 2 | 2021 |
| 265084 | 2021-03-09 | F | West LA | 19 | A | 19 | 3 | 2021 |
| 128226 | 2020-08-09 | F | Pacific | 31 | W | 19 | 8 | 2020 |
| 88874 | 2020-08-21 | F | Mission | 57 | O | 11 | 8 | 2020 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 271580 | 2021-01-07 | F | 77th Street | 34 | H | 8 | 1 | 2021 |
| 79708 | 2020-04-29 | M | Wilshire | 30 | C | 12 | 4 | 2020 |
| 266620 | 2021-07-24 | F | N Hollywood | 36 | W | 6 | 7 | 2021 |
| 189307 | 2020-12-26 | M | West Valley | 53 | A | 21 | 12 | 2020 |
| 195530 | 2020-06-22 | F | West Valley | 16 | H | 17 | 6 | 2020 |
5000 rows × 8 columns
Then we’ll drop the X in the Vict Sex column
df_pre = df_pre[df_pre["Vict Sex"].str.contains("M") | df_pre["Vict Sex"].str.contains("F")]
df_pre
| DATE OCC | Vict Sex | AREA NAME | Vict Age | Vict Descent | Hour | Month | Year | |
|---|---|---|---|---|---|---|---|---|
| 297056 | 2021-02-14 | M | Olympic | 54 | H | 12 | 2 | 2021 |
| 314052 | 2021-02-02 | M | Devonshire | 43 | B | 11 | 2 | 2021 |
| 265084 | 2021-03-09 | F | West LA | 19 | A | 19 | 3 | 2021 |
| 128226 | 2020-08-09 | F | Pacific | 31 | W | 19 | 8 | 2020 |
| 88874 | 2020-08-21 | F | Mission | 57 | O | 11 | 8 | 2020 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 271580 | 2021-01-07 | F | 77th Street | 34 | H | 8 | 1 | 2021 |
| 79708 | 2020-04-29 | M | Wilshire | 30 | C | 12 | 4 | 2020 |
| 266620 | 2021-07-24 | F | N Hollywood | 36 | W | 6 | 7 | 2021 |
| 189307 | 2020-12-26 | M | West Valley | 53 | A | 21 | 12 | 2020 |
| 195530 | 2020-06-22 | F | West Valley | 16 | H | 17 | 6 | 2020 |
4954 rows × 8 columns
crime = ["Vict Age","Hour","Month","Year"]
What we do here is split our data into train and test set, and using the Vict age, Hour, Month and Year to predict the Vict Sex.
X = df_pre[crime]
y = df_pre['Vict Sex']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)
Here we use the decision tree, and plot the graph below.
# See what contains in the DecisionTreeClassifier
help(DecisionTreeClassifier)
Help on class DecisionTreeClassifier in module sklearn.tree._classes:
class DecisionTreeClassifier(sklearn.base.ClassifierMixin, BaseDecisionTree)
| DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)
|
| A decision tree classifier.
|
| Read more in the :ref:`User Guide <tree>`.
|
| Parameters
| ----------
| criterion : {"gini", "entropy"}, default="gini"
| The function to measure the quality of a split. Supported criteria are
| "gini" for the Gini impurity and "entropy" for the information gain.
|
| splitter : {"best", "random"}, default="best"
| The strategy used to choose the split at each node. Supported
| strategies are "best" to choose the best split and "random" to choose
| the best random split.
|
| max_depth : int, default=None
| The maximum depth of the tree. If None, then nodes are expanded until
| all leaves are pure or until all leaves contain less than
| min_samples_split samples.
|
| min_samples_split : int or float, default=2
| The minimum number of samples required to split an internal node:
|
| - If int, then consider `min_samples_split` as the minimum number.
| - If float, then `min_samples_split` is a fraction and
| `ceil(min_samples_split * n_samples)` are the minimum
| number of samples for each split.
|
| .. versionchanged:: 0.18
| Added float values for fractions.
|
| min_samples_leaf : int or float, default=1
| The minimum number of samples required to be at a leaf node.
| A split point at any depth will only be considered if it leaves at
| least ``min_samples_leaf`` training samples in each of the left and
| right branches. This may have the effect of smoothing the model,
| especially in regression.
|
| - If int, then consider `min_samples_leaf` as the minimum number.
| - If float, then `min_samples_leaf` is a fraction and
| `ceil(min_samples_leaf * n_samples)` are the minimum
| number of samples for each node.
|
| .. versionchanged:: 0.18
| Added float values for fractions.
|
| min_weight_fraction_leaf : float, default=0.0
| The minimum weighted fraction of the sum total of weights (of all
| the input samples) required to be at a leaf node. Samples have
| equal weight when sample_weight is not provided.
|
| max_features : int, float or {"auto", "sqrt", "log2"}, default=None
| The number of features to consider when looking for the best split:
|
| - If int, then consider `max_features` features at each split.
| - If float, then `max_features` is a fraction and
| `int(max_features * n_features)` features are considered at each
| split.
| - If "auto", then `max_features=sqrt(n_features)`.
| - If "sqrt", then `max_features=sqrt(n_features)`.
| - If "log2", then `max_features=log2(n_features)`.
| - If None, then `max_features=n_features`.
|
| Note: the search for a split does not stop until at least one
| valid partition of the node samples is found, even if it requires to
| effectively inspect more than ``max_features`` features.
|
| random_state : int, RandomState instance or None, default=None
| Controls the randomness of the estimator. The features are always
| randomly permuted at each split, even if ``splitter`` is set to
| ``"best"``. When ``max_features < n_features``, the algorithm will
| select ``max_features`` at random at each split before finding the best
| split among them. But the best found split may vary across different
| runs, even if ``max_features=n_features``. That is the case, if the
| improvement of the criterion is identical for several splits and one
| split has to be selected at random. To obtain a deterministic behaviour
| during fitting, ``random_state`` has to be fixed to an integer.
| See :term:`Glossary <random_state>` for details.
|
| max_leaf_nodes : int, default=None
| Grow a tree with ``max_leaf_nodes`` in best-first fashion.
| Best nodes are defined as relative reduction in impurity.
| If None then unlimited number of leaf nodes.
|
| min_impurity_decrease : float, default=0.0
| A node will be split if this split induces a decrease of the impurity
| greater than or equal to this value.
|
| The weighted impurity decrease equation is the following::
|
| N_t / N * (impurity - N_t_R / N_t * right_impurity
| - N_t_L / N_t * left_impurity)
|
| where ``N`` is the total number of samples, ``N_t`` is the number of
| samples at the current node, ``N_t_L`` is the number of samples in the
| left child, and ``N_t_R`` is the number of samples in the right child.
|
| ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
| if ``sample_weight`` is passed.
|
| .. versionadded:: 0.19
|
| class_weight : dict, list of dict or "balanced", default=None
| Weights associated with classes in the form ``{class_label: weight}``.
| If None, all classes are supposed to have weight one. For
| multi-output problems, a list of dicts can be provided in the same
| order as the columns of y.
|
| Note that for multioutput (including multilabel) weights should be
| defined for each class of every column in its own dict. For example,
| for four-class multilabel classification weights should be
| [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
| [{1:1}, {2:5}, {3:1}, {4:1}].
|
| The "balanced" mode uses the values of y to automatically adjust
| weights inversely proportional to class frequencies in the input data
| as ``n_samples / (n_classes * np.bincount(y))``
|
| For multi-output, the weights of each column of y will be multiplied.
|
| Note that these weights will be multiplied with sample_weight (passed
| through the fit method) if sample_weight is specified.
|
| ccp_alpha : non-negative float, default=0.0
| Complexity parameter used for Minimal Cost-Complexity Pruning. The
| subtree with the largest cost complexity that is smaller than
| ``ccp_alpha`` will be chosen. By default, no pruning is performed. See
| :ref:`minimal_cost_complexity_pruning` for details.
|
| .. versionadded:: 0.22
|
| Attributes
| ----------
| classes_ : ndarray of shape (n_classes,) or list of ndarray
| The classes labels (single output problem),
| or a list of arrays of class labels (multi-output problem).
|
| feature_importances_ : ndarray of shape (n_features,)
| The impurity-based feature importances.
| The higher, the more important the feature.
| The importance of a feature is computed as the (normalized)
| total reduction of the criterion brought by that feature. It is also
| known as the Gini importance [4]_.
|
| Warning: impurity-based feature importances can be misleading for
| high cardinality features (many unique values). See
| :func:`sklearn.inspection.permutation_importance` as an alternative.
|
| max_features_ : int
| The inferred value of max_features.
|
| n_classes_ : int or list of int
| The number of classes (for single output problems),
| or a list containing the number of classes for each
| output (for multi-output problems).
|
| n_features_ : int
| The number of features when ``fit`` is performed.
|
| .. deprecated:: 1.0
| `n_features_` is deprecated in 1.0 and will be removed in
| 1.2. Use `n_features_in_` instead.
|
| n_features_in_ : int
| Number of features seen during :term:`fit`.
|
| .. versionadded:: 0.24
|
| feature_names_in_ : ndarray of shape (`n_features_in_`,)
| Names of features seen during :term:`fit`. Defined only when `X`
| has feature names that are all strings.
|
| .. versionadded:: 1.0
|
| n_outputs_ : int
| The number of outputs when ``fit`` is performed.
|
| tree_ : Tree instance
| The underlying Tree object. Please refer to
| ``help(sklearn.tree._tree.Tree)`` for attributes of Tree object and
| :ref:`sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py`
| for basic usage of these attributes.
|
| See Also
| --------
| DecisionTreeRegressor : A decision tree regressor.
|
| Notes
| -----
| The default values for the parameters controlling the size of the trees
| (e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
| unpruned trees which can potentially be very large on some data sets. To
| reduce memory consumption, the complexity and size of the trees should be
| controlled by setting those parameter values.
|
| The :meth:`predict` method operates using the :func:`numpy.argmax`
| function on the outputs of :meth:`predict_proba`. This means that in
| case the highest predicted probabilities are tied, the classifier will
| predict the tied class with the lowest index in :term:`classes_`.
|
| References
| ----------
|
| .. [1] https://en.wikipedia.org/wiki/Decision_tree_learning
|
| .. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification
| and Regression Trees", Wadsworth, Belmont, CA, 1984.
|
| .. [3] T. Hastie, R. Tibshirani and J. Friedman. "Elements of Statistical
| Learning", Springer, 2009.
|
| .. [4] L. Breiman, and A. Cutler, "Random Forests",
| https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
|
| Examples
| --------
| >>> from sklearn.datasets import load_iris
| >>> from sklearn.model_selection import cross_val_score
| >>> from sklearn.tree import DecisionTreeClassifier
| >>> clf = DecisionTreeClassifier(random_state=0)
| >>> iris = load_iris()
| >>> cross_val_score(clf, iris.data, iris.target, cv=10)
| ... # doctest: +SKIP
| ...
| array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
| 0.93..., 0.93..., 1. , 0.93..., 1. ])
|
| Method resolution order:
| DecisionTreeClassifier
| sklearn.base.ClassifierMixin
| BaseDecisionTree
| sklearn.base.MultiOutputMixin
| sklearn.base.BaseEstimator
| builtins.object
|
| Methods defined here:
|
| __init__(self, *, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)
| Initialize self. See help(type(self)) for accurate signature.
|
| fit(self, X, y, sample_weight=None, check_input=True, X_idx_sorted='deprecated')
| Build a decision tree classifier from the training set (X, y).
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The training input samples. Internally, it will be converted to
| ``dtype=np.float32`` and if a sparse matrix is provided
| to a sparse ``csc_matrix``.
|
| y : array-like of shape (n_samples,) or (n_samples, n_outputs)
| The target values (class labels) as integers or strings.
|
| sample_weight : array-like of shape (n_samples,), default=None
| Sample weights. If None, then samples are equally weighted. Splits
| that would create child nodes with net zero or negative weight are
| ignored while searching for a split in each node. Splits are also
| ignored if they would result in any single class carrying a
| negative weight in either child node.
|
| check_input : bool, default=True
| Allow to bypass several input checking.
| Don't use this parameter unless you know what you do.
|
| X_idx_sorted : deprecated, default="deprecated"
| This parameter is deprecated and has no effect.
| It will be removed in 1.1 (renaming of 0.26).
|
| .. deprecated:: 0.24
|
| Returns
| -------
| self : DecisionTreeClassifier
| Fitted estimator.
|
| predict_log_proba(self, X)
| Predict class log-probabilities of the input samples X.
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The input samples. Internally, it will be converted to
| ``dtype=np.float32`` and if a sparse matrix is provided
| to a sparse ``csr_matrix``.
|
| Returns
| -------
| proba : ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
| The class log-probabilities of the input samples. The order of the
| classes corresponds to that in the attribute :term:`classes_`.
|
| predict_proba(self, X, check_input=True)
| Predict class probabilities of the input samples X.
|
| The predicted class probability is the fraction of samples of the same
| class in a leaf.
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The input samples. Internally, it will be converted to
| ``dtype=np.float32`` and if a sparse matrix is provided
| to a sparse ``csr_matrix``.
|
| check_input : bool, default=True
| Allow to bypass several input checking.
| Don't use this parameter unless you know what you do.
|
| Returns
| -------
| proba : ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1
| The class probabilities of the input samples. The order of the
| classes corresponds to that in the attribute :term:`classes_`.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| n_features_
| DEPRECATED: The attribute `n_features_` is deprecated in 1.0 and will be removed in 1.2. Use `n_features_in_` instead.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset()
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.ClassifierMixin:
|
| score(self, X, y, sample_weight=None)
| Return the mean accuracy on the given test data and labels.
|
| In multi-label classification, this is the subset accuracy
| which is a harsh metric since you require for each sample that
| each label set be correctly predicted.
|
| Parameters
| ----------
| X : array-like of shape (n_samples, n_features)
| Test samples.
|
| y : array-like of shape (n_samples,) or (n_samples, n_outputs)
| True labels for `X`.
|
| sample_weight : array-like of shape (n_samples,), default=None
| Sample weights.
|
| Returns
| -------
| score : float
| Mean accuracy of ``self.predict(X)`` wrt. `y`.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from sklearn.base.ClassifierMixin:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from BaseDecisionTree:
|
| apply(self, X, check_input=True)
| Return the index of the leaf that each sample is predicted as.
|
| .. versionadded:: 0.17
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The input samples. Internally, it will be converted to
| ``dtype=np.float32`` and if a sparse matrix is provided
| to a sparse ``csr_matrix``.
|
| check_input : bool, default=True
| Allow to bypass several input checking.
| Don't use this parameter unless you know what you do.
|
| Returns
| -------
| X_leaves : array-like of shape (n_samples,)
| For each datapoint x in X, return the index of the leaf x
| ends up in. Leaves are numbered within
| ``[0; self.tree_.node_count)``, possibly with gaps in the
| numbering.
|
| cost_complexity_pruning_path(self, X, y, sample_weight=None)
| Compute the pruning path during Minimal Cost-Complexity Pruning.
|
| See :ref:`minimal_cost_complexity_pruning` for details on the pruning
| process.
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The training input samples. Internally, it will be converted to
| ``dtype=np.float32`` and if a sparse matrix is provided
| to a sparse ``csc_matrix``.
|
| y : array-like of shape (n_samples,) or (n_samples, n_outputs)
| The target values (class labels) as integers or strings.
|
| sample_weight : array-like of shape (n_samples,), default=None
| Sample weights. If None, then samples are equally weighted. Splits
| that would create child nodes with net zero or negative weight are
| ignored while searching for a split in each node. Splits are also
| ignored if they would result in any single class carrying a
| negative weight in either child node.
|
| Returns
| -------
| ccp_path : :class:`~sklearn.utils.Bunch`
| Dictionary-like object, with the following attributes.
|
| ccp_alphas : ndarray
| Effective alphas of subtree during pruning.
|
| impurities : ndarray
| Sum of the impurities of the subtree leaves for the
| corresponding alpha value in ``ccp_alphas``.
|
| decision_path(self, X, check_input=True)
| Return the decision path in the tree.
|
| .. versionadded:: 0.18
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The input samples. Internally, it will be converted to
| ``dtype=np.float32`` and if a sparse matrix is provided
| to a sparse ``csr_matrix``.
|
| check_input : bool, default=True
| Allow to bypass several input checking.
| Don't use this parameter unless you know what you do.
|
| Returns
| -------
| indicator : sparse matrix of shape (n_samples, n_nodes)
| Return a node indicator CSR matrix where non zero elements
| indicates that the samples goes through the nodes.
|
| get_depth(self)
| Return the depth of the decision tree.
|
| The depth of a tree is the maximum distance between the root
| and any leaf.
|
| Returns
| -------
| self.tree_.max_depth : int
| The maximum depth of the tree.
|
| get_n_leaves(self)
| Return the number of leaves of the decision tree.
|
| Returns
| -------
| self.tree_.n_leaves : int
| Number of leaves.
|
| predict(self, X, check_input=True)
| Predict class or regression value for X.
|
| For a classification model, the predicted class for each sample in X is
| returned. For a regression model, the predicted value based on X is
| returned.
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The input samples. Internally, it will be converted to
| ``dtype=np.float32`` and if a sparse matrix is provided
| to a sparse ``csr_matrix``.
|
| check_input : bool, default=True
| Allow to bypass several input checking.
| Don't use this parameter unless you know what you do.
|
| Returns
| -------
| y : array-like of shape (n_samples,) or (n_samples, n_outputs)
| The predicted classes, or the predict values.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from BaseDecisionTree:
|
| feature_importances_
| Return the feature importances.
|
| The importance of a feature is computed as the (normalized) total
| reduction of the criterion brought by that feature.
| It is also known as the Gini importance.
|
| Warning: impurity-based feature importances can be misleading for
| high cardinality features (many unique values). See
| :func:`sklearn.inspection.permutation_importance` as an alternative.
|
| Returns
| -------
| feature_importances_ : ndarray of shape (n_features,)
| Normalized total reduction of criteria by feature
| (Gini importance).
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.BaseEstimator:
|
| __getstate__(self)
|
| __repr__(self, N_CHAR_MAX=700)
| Return repr(self).
|
| __setstate__(self, state)
|
| get_params(self, deep=True)
| Get parameters for this estimator.
|
| Parameters
| ----------
| deep : bool, default=True
| If True, will return the parameters for this estimator and
| contained subobjects that are estimators.
|
| Returns
| -------
| params : dict
| Parameter names mapped to their values.
|
| set_params(self, **params)
| Set the parameters of this estimator.
|
| The method works on simple estimators as well as on nested objects
| (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
| parameters of the form ``<component>__<parameter>`` so that it's
| possible to update each component of a nested object.
|
| Parameters
| ----------
| **params : dict
| Estimator parameters.
|
| Returns
| -------
| self : estimator instance
| Estimator instance.
tree = DecisionTreeClassifier(max_depth = 3)
tree.fit(X_train, y_train)
tree_predictions = tree.predict(X_test)
sex = df_pre['Vict Sex'].unique()
plt.figure(figsize=(50,25))
plot_tree(tree, filled=True, rounded=True, class_names=sex, feature_names=X.columns)
plt.show()
Then we get the report to see our accuracy.
print(classification_report(y_test, tree_predictions))
precision recall f1-score support
F 0.56 0.25 0.35 507
M 0.50 0.79 0.61 484
accuracy 0.52 991
macro avg 0.53 0.52 0.48 991
weighted avg 0.53 0.52 0.48 991
It seems like the accuracy of predicitng Female Victim is a little bit higher than predicting Male Victim. But the number 56% is not good enough for the precision. Let us see the k nearest neighbors!
from sklearn.neighbors import KNeighborsClassifier
help(KNeighborsClassifier)
Help on class KNeighborsClassifier in module sklearn.neighbors._classification:
class KNeighborsClassifier(sklearn.neighbors._base.KNeighborsMixin, sklearn.base.ClassifierMixin, sklearn.neighbors._base.NeighborsBase)
| KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
|
| Classifier implementing the k-nearest neighbors vote.
|
| Read more in the :ref:`User Guide <classification>`.
|
| Parameters
| ----------
| n_neighbors : int, default=5
| Number of neighbors to use by default for :meth:`kneighbors` queries.
|
| weights : {'uniform', 'distance'} or callable, default='uniform'
| Weight function used in prediction. Possible values:
|
| - 'uniform' : uniform weights. All points in each neighborhood
| are weighted equally.
| - 'distance' : weight points by the inverse of their distance.
| in this case, closer neighbors of a query point will have a
| greater influence than neighbors which are further away.
| - [callable] : a user-defined function which accepts an
| array of distances, and returns an array of the same shape
| containing the weights.
|
| algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
| Algorithm used to compute the nearest neighbors:
|
| - 'ball_tree' will use :class:`BallTree`
| - 'kd_tree' will use :class:`KDTree`
| - 'brute' will use a brute-force search.
| - 'auto' will attempt to decide the most appropriate algorithm
| based on the values passed to :meth:`fit` method.
|
| Note: fitting on sparse input will override the setting of
| this parameter, using brute force.
|
| leaf_size : int, default=30
| Leaf size passed to BallTree or KDTree. This can affect the
| speed of the construction and query, as well as the memory
| required to store the tree. The optimal value depends on the
| nature of the problem.
|
| p : int, default=2
| Power parameter for the Minkowski metric. When p = 1, this is
| equivalent to using manhattan_distance (l1), and euclidean_distance
| (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
|
| metric : str or callable, default='minkowski'
| The distance metric to use for the tree. The default metric is
| minkowski, and with p=2 is equivalent to the standard Euclidean
| metric. For a list of available metrics, see the documentation of
| :class:`~sklearn.metrics.DistanceMetric`.
| If metric is "precomputed", X is assumed to be a distance matrix and
| must be square during fit. X may be a :term:`sparse graph`,
| in which case only "nonzero" elements may be considered neighbors.
|
| metric_params : dict, default=None
| Additional keyword arguments for the metric function.
|
| n_jobs : int, default=None
| The number of parallel jobs to run for neighbors search.
| ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
| ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
| for more details.
| Doesn't affect :meth:`fit` method.
|
| Attributes
| ----------
| classes_ : array of shape (n_classes,)
| Class labels known to the classifier
|
| effective_metric_ : str or callble
| The distance metric used. It will be same as the `metric` parameter
| or a synonym of it, e.g. 'euclidean' if the `metric` parameter set to
| 'minkowski' and `p` parameter set to 2.
|
| effective_metric_params_ : dict
| Additional keyword arguments for the metric function. For most metrics
| will be same with `metric_params` parameter, but may also contain the
| `p` parameter value if the `effective_metric_` attribute is set to
| 'minkowski'.
|
| n_features_in_ : int
| Number of features seen during :term:`fit`.
|
| .. versionadded:: 0.24
|
| feature_names_in_ : ndarray of shape (`n_features_in_`,)
| Names of features seen during :term:`fit`. Defined only when `X`
| has feature names that are all strings.
|
| .. versionadded:: 1.0
|
| n_samples_fit_ : int
| Number of samples in the fitted data.
|
| outputs_2d_ : bool
| False when `y`'s shape is (n_samples, ) or (n_samples, 1) during fit
| otherwise True.
|
| See Also
| --------
| RadiusNeighborsClassifier: Classifier based on neighbors within a fixed radius.
| KNeighborsRegressor: Regression based on k-nearest neighbors.
| RadiusNeighborsRegressor: Regression based on neighbors within a fixed radius.
| NearestNeighbors: Unsupervised learner for implementing neighbor searches.
|
| Notes
| -----
| See :ref:`Nearest Neighbors <neighbors>` in the online documentation
| for a discussion of the choice of ``algorithm`` and ``leaf_size``.
|
| .. warning::
|
| Regarding the Nearest Neighbors algorithms, if it is found that two
| neighbors, neighbor `k+1` and `k`, have identical distances
| but different labels, the results will depend on the ordering of the
| training data.
|
| https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
|
| Examples
| --------
| >>> X = [[0], [1], [2], [3]]
| >>> y = [0, 0, 1, 1]
| >>> from sklearn.neighbors import KNeighborsClassifier
| >>> neigh = KNeighborsClassifier(n_neighbors=3)
| >>> neigh.fit(X, y)
| KNeighborsClassifier(...)
| >>> print(neigh.predict([[1.1]]))
| [0]
| >>> print(neigh.predict_proba([[0.9]]))
| [[0.666... 0.333...]]
|
| Method resolution order:
| KNeighborsClassifier
| sklearn.neighbors._base.KNeighborsMixin
| sklearn.base.ClassifierMixin
| sklearn.neighbors._base.NeighborsBase
| sklearn.base.MultiOutputMixin
| sklearn.base.BaseEstimator
| builtins.object
|
| Methods defined here:
|
| __init__(self, n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
| Initialize self. See help(type(self)) for accurate signature.
|
| fit(self, X, y)
| Fit the k-nearest neighbors classifier from the training dataset.
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples) if metric='precomputed'
| Training data.
|
| y : {array-like, sparse matrix} of shape (n_samples,) or (n_samples, n_outputs)
| Target values.
|
| Returns
| -------
| self : KNeighborsClassifier
| The fitted k-nearest neighbors classifier.
|
| predict(self, X)
| Predict the class labels for the provided data.
|
| Parameters
| ----------
| X : array-like of shape (n_queries, n_features), or (n_queries, n_indexed) if metric == 'precomputed'
| Test samples.
|
| Returns
| -------
| y : ndarray of shape (n_queries,) or (n_queries, n_outputs)
| Class labels for each data sample.
|
| predict_proba(self, X)
| Return probability estimates for the test data X.
|
| Parameters
| ----------
| X : array-like of shape (n_queries, n_features), or (n_queries, n_indexed) if metric == 'precomputed'
| Test samples.
|
| Returns
| -------
| p : ndarray of shape (n_queries, n_classes), or a list of n_outputs of such arrays if n_outputs > 1.
| The class probabilities of the input samples. Classes are ordered
| by lexicographic order.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset()
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.neighbors._base.KNeighborsMixin:
|
| kneighbors(self, X=None, n_neighbors=None, return_distance=True)
| Find the K-neighbors of a point.
|
| Returns indices of and distances to the neighbors of each point.
|
| Parameters
| ----------
| X : array-like, shape (n_queries, n_features), or (n_queries, n_indexed) if metric == 'precomputed', default=None
| The query point or points.
| If not provided, neighbors of each indexed point are returned.
| In this case, the query point is not considered its own neighbor.
|
| n_neighbors : int, default=None
| Number of neighbors required for each sample. The default is the
| value passed to the constructor.
|
| return_distance : bool, default=True
| Whether or not to return the distances.
|
| Returns
| -------
| neigh_dist : ndarray of shape (n_queries, n_neighbors)
| Array representing the lengths to points, only present if
| return_distance=True.
|
| neigh_ind : ndarray of shape (n_queries, n_neighbors)
| Indices of the nearest points in the population matrix.
|
| Examples
| --------
| In the following example, we construct a NearestNeighbors
| class from an array representing our data set and ask who's
| the closest point to [1,1,1]
|
| >>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
| >>> from sklearn.neighbors import NearestNeighbors
| >>> neigh = NearestNeighbors(n_neighbors=1)
| >>> neigh.fit(samples)
| NearestNeighbors(n_neighbors=1)
| >>> print(neigh.kneighbors([[1., 1., 1.]]))
| (array([[0.5]]), array([[2]]))
|
| As you can see, it returns [[0.5]], and [[2]], which means that the
| element is at distance 0.5 and is the third element of samples
| (indexes start at 0). You can also query for multiple points:
|
| >>> X = [[0., 1., 0.], [1., 0., 1.]]
| >>> neigh.kneighbors(X, return_distance=False)
| array([[1],
| [2]]...)
|
| kneighbors_graph(self, X=None, n_neighbors=None, mode='connectivity')
| Compute the (weighted) graph of k-Neighbors for points in X.
|
| Parameters
| ----------
| X : array-like of shape (n_queries, n_features), or (n_queries, n_indexed) if metric == 'precomputed', default=None
| The query point or points.
| If not provided, neighbors of each indexed point are returned.
| In this case, the query point is not considered its own neighbor.
| For ``metric='precomputed'`` the shape should be
| (n_queries, n_indexed). Otherwise the shape should be
| (n_queries, n_features).
|
| n_neighbors : int, default=None
| Number of neighbors for each sample. The default is the value
| passed to the constructor.
|
| mode : {'connectivity', 'distance'}, default='connectivity'
| Type of returned matrix: 'connectivity' will return the
| connectivity matrix with ones and zeros, in 'distance' the
| edges are distances between points, type of distance
| depends on the selected metric parameter in
| NearestNeighbors class.
|
| Returns
| -------
| A : sparse-matrix of shape (n_queries, n_samples_fit)
| `n_samples_fit` is the number of samples in the fitted data.
| `A[i, j]` gives the weight of the edge connecting `i` to `j`.
| The matrix is of CSR format.
|
| See Also
| --------
| NearestNeighbors.radius_neighbors_graph : Compute the (weighted) graph
| of Neighbors for points in X.
|
| Examples
| --------
| >>> X = [[0], [3], [1]]
| >>> from sklearn.neighbors import NearestNeighbors
| >>> neigh = NearestNeighbors(n_neighbors=2)
| >>> neigh.fit(X)
| NearestNeighbors(n_neighbors=2)
| >>> A = neigh.kneighbors_graph(X)
| >>> A.toarray()
| array([[1., 0., 1.],
| [0., 1., 1.],
| [1., 0., 1.]])
|
| ----------------------------------------------------------------------
| Data descriptors inherited from sklearn.neighbors._base.KNeighborsMixin:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.ClassifierMixin:
|
| score(self, X, y, sample_weight=None)
| Return the mean accuracy on the given test data and labels.
|
| In multi-label classification, this is the subset accuracy
| which is a harsh metric since you require for each sample that
| each label set be correctly predicted.
|
| Parameters
| ----------
| X : array-like of shape (n_samples, n_features)
| Test samples.
|
| y : array-like of shape (n_samples,) or (n_samples, n_outputs)
| True labels for `X`.
|
| sample_weight : array-like of shape (n_samples,), default=None
| Sample weights.
|
| Returns
| -------
| score : float
| Mean accuracy of ``self.predict(X)`` wrt. `y`.
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.BaseEstimator:
|
| __getstate__(self)
|
| __repr__(self, N_CHAR_MAX=700)
| Return repr(self).
|
| __setstate__(self, state)
|
| get_params(self, deep=True)
| Get parameters for this estimator.
|
| Parameters
| ----------
| deep : bool, default=True
| If True, will return the parameters for this estimator and
| contained subobjects that are estimators.
|
| Returns
| -------
| params : dict
| Parameter names mapped to their values.
|
| set_params(self, **params)
| Set the parameters of this estimator.
|
| The method works on simple estimators as well as on nested objects
| (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
| parameters of the form ``<component>__<parameter>`` so that it's
| possible to update each component of a nested object.
|
| Parameters
| ----------
| **params : dict
| Estimator parameters.
|
| Returns
| -------
| self : estimator instance
| Estimator instance.
knn = KNeighborsClassifier(n_neighbors=300)
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)
print(classification_report(y_test, knn_predictions))
precision recall f1-score support
F 0.54 0.35 0.42 507
M 0.50 0.68 0.58 484
accuracy 0.51 991
macro avg 0.52 0.52 0.50 991
weighted avg 0.52 0.51 0.50 991
From above two report we can see there is no differences between K-Nearest Neighbors and Decision Tree for predicting the Male Victims since they have the same precision, which is 50%. However, for predicting Female Victims, Decision Tree is a litte be higher. Both around 50%, it’s not a good number, maybe we need to contain more information or increase our size of training data.
📉Linear Regression#
For here, we’ll create a new datafram, which contians the total number victims per month, and the month column. Then we’ll drop the missing value.
new_data_frame = df_pre[["Month","Vict Age"]]
new_data_frame = new_data_frame.groupby("Month").count()
another_datafram = pd.DataFrame([0,1,2,3,4,5,6,7,8,9,10,11,12])
result = pd.concat([new_data_frame, another_datafram], axis=1, ignore_index=True)
result = result.rename(columns={0: "Total Victims", 1: "Month"})
result
| Total Victims | Month | |
|---|---|---|
| 0 | NaN | 0 |
| 1 | 537.0 | 1 |
| 2 | 534.0 | 2 |
| 3 | 549.0 | 3 |
| 4 | 430.0 | 4 |
| 5 | 529.0 | 5 |
| 6 | 515.0 | 6 |
| 7 | 505.0 | 7 |
| 8 | 355.0 | 8 |
| 9 | 253.0 | 9 |
| 10 | 228.0 | 10 |
| 11 | 266.0 | 11 |
| 12 | 253.0 | 12 |
predict_datafram = result.dropna().copy()
predict_datafram
| Total Victims | Month | |
|---|---|---|
| 1 | 537.0 | 1 |
| 2 | 534.0 | 2 |
| 3 | 549.0 | 3 |
| 4 | 430.0 | 4 |
| 5 | 529.0 | 5 |
| 6 | 515.0 | 6 |
| 7 | 505.0 | 7 |
| 8 | 355.0 | 8 |
| 9 | 253.0 | 9 |
| 10 | 228.0 | 10 |
| 11 | 266.0 | 11 |
| 12 | 253.0 | 12 |
# Instantiate Linear Regression
reg = LinearRegression()
Then we’ll fit our data, to predict the total number victims per month.
reg.fit(predict_datafram[["Month"]],predict_datafram["Total Victims"])
LinearRegression()
Let’s see the coefficient and the intercept.
reg.coef_[0]
-32.16783216783216
reg.intercept_
621.9242424242424
Then we’ll add a column named pred into our datafram, which contians the predit value.
predict_datafram["pred"] = reg.predict(predict_datafram[["Month"]])
predict_datafram
| Total Victims | Month | pred | |
|---|---|---|---|
| 1 | 537.0 | 1 | 589.756410 |
| 2 | 534.0 | 2 | 557.588578 |
| 3 | 549.0 | 3 | 525.420746 |
| 4 | 430.0 | 4 | 493.252914 |
| 5 | 529.0 | 5 | 461.085082 |
| 6 | 515.0 | 6 | 428.917249 |
| 7 | 505.0 | 7 | 396.749417 |
| 8 | 355.0 | 8 | 364.581585 |
| 9 | 253.0 | 9 | 332.413753 |
| 10 | 228.0 | 10 | 300.245921 |
| 11 | 266.0 | 11 | 268.078089 |
| 12 | 253.0 | 12 | 235.910256 |
Plot the prediction graph.
prediction = alt.Chart(predict_datafram).mark_line(color="magenta").encode(
x = "Month",
y = "pred",
tooltip = ["pred","Month"]
).properties(title = "Predict Victims Per Month")
prediction
Layer the prediction and the actual chart together to see if it’s a good fit. Before we do that, let’s see what will look like for our actual data
actual = alt.Chart(predict_datafram).mark_bar(size=20).encode(
x="Month",
y="Total Victims",
tooltip = ["Total Victims","Month"]
).properties(title = "Actual Victims Per Month")
actual
(actual + prediction).properties(title = "Predict and Actual Victims Per Month")
From the above chart, it seems a good fit for me, since the actual value is decreasing, and the line is decreasing too.
📎 K-Means Clustering#
Let us recape what information we have in the df_pre dataset.
df_pre.sample(5)
| DATE OCC | Vict Sex | AREA NAME | Vict Age | Vict Descent | Hour | Month | Year | |
|---|---|---|---|---|---|---|---|---|
| 317200 | 2021-05-29 | M | West Valley | 49 | O | 15 | 5 | 2021 |
| 101959 | 2020-04-23 | M | Northeast | 30 | W | 12 | 4 | 2020 |
| 13438 | 2020-03-24 | M | Topanga | 21 | O | 23 | 3 | 2020 |
| 143107 | 2020-09-03 | F | West LA | 21 | H | 18 | 9 | 2020 |
| 79287 | 2020-05-14 | M | Southwest | 72 | B | 20 | 5 | 2020 |
k_sub = df_pre.copy()
Then we’ll import StandardScaler and instantiate it.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Import the is_numeric_dtype to get the column which has the numercial value.
from pandas.api.types import is_numeric_dtype
cluster_column = [c for c in k_sub.columns if is_numeric_dtype(k_sub[c])]
cluster_column
['Vict Age', 'Hour', 'Month', 'Year']
Fit the cluster_column.
scaler.fit(k_sub[cluster_column])
StandardScaler()
k_sub[cluster_column] = scaler.transform(k_sub[cluster_column])
k_sub
| DATE OCC | Vict Sex | AREA NAME | Vict Age | Vict Descent | Hour | Month | Year | |
|---|---|---|---|---|---|---|---|---|
| 297056 | 2021-02-14 | M | Olympic | 0.918022 | H | -0.314625 | -1.103873 | 1.272705 |
| 314052 | 2021-02-02 | M | Devonshire | 0.205239 | B | -0.482024 | -1.103873 | 1.272705 |
| 265084 | 2021-03-09 | F | West LA | -1.349925 | A | 0.857168 | -0.794791 | 1.272705 |
| 128226 | 2020-08-09 | F | Pacific | -0.572343 | W | 0.857168 | 0.750618 | -0.785728 |
| 88874 | 2020-08-21 | F | Mission | 1.112418 | O | -0.482024 | 0.750618 | -0.785728 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 271580 | 2021-01-07 | F | 77th Street | -0.377948 | H | -0.984221 | -1.412954 | 1.272705 |
| 79708 | 2020-04-29 | M | Wilshire | -0.637142 | C | -0.314625 | -0.485709 | -0.785728 |
| 266620 | 2021-07-24 | F | N Hollywood | -0.248351 | W | -1.319019 | 0.441537 | 1.272705 |
| 189307 | 2020-12-26 | M | West Valley | 0.853224 | A | 1.191966 | 1.986946 | -0.785728 |
| 195530 | 2020-06-22 | F | West Valley | -1.544321 | H | 0.522370 | 0.132455 | -0.785728 |
4954 rows × 8 columns
Let us see the mean and standard deviations.
k_sub.mean(axis=0)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
"""Entry point for launching an IPython kernel.
Vict Age 3.155418e-17
Hour -9.322826e-17
Month 1.204796e-16
Year 1.040750e-14
dtype: float64
k_sub.std(axis=0)
DATE OCC 174 days 03:08:29.086223064
Vict Age 1.000101
Hour 1.000101
Month 1.000101
Year 1.000101
dtype: object
From above, we could see the mean is around 0, and the std is around 1, which is good.
What we’ll do next is importing the KMeans.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2)
kmeans.fit(k_sub[cluster_column])
KMeans(n_clusters=2)
Add a nue column to our dataframe, which is named cluster.
k_sub["cluster"] = kmeans.predict(k_sub[cluster_column])
k_sub
| DATE OCC | Vict Sex | AREA NAME | Vict Age | Vict Descent | Hour | Month | Year | cluster | |
|---|---|---|---|---|---|---|---|---|---|
| 297056 | 2021-02-14 | M | Olympic | 0.918022 | H | -0.314625 | -1.103873 | 1.272705 | 1 |
| 314052 | 2021-02-02 | M | Devonshire | 0.205239 | B | -0.482024 | -1.103873 | 1.272705 | 1 |
| 265084 | 2021-03-09 | F | West LA | -1.349925 | A | 0.857168 | -0.794791 | 1.272705 | 1 |
| 128226 | 2020-08-09 | F | Pacific | -0.572343 | W | 0.857168 | 0.750618 | -0.785728 | 0 |
| 88874 | 2020-08-21 | F | Mission | 1.112418 | O | -0.482024 | 0.750618 | -0.785728 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 271580 | 2021-01-07 | F | 77th Street | -0.377948 | H | -0.984221 | -1.412954 | 1.272705 | 1 |
| 79708 | 2020-04-29 | M | Wilshire | -0.637142 | C | -0.314625 | -0.485709 | -0.785728 | 0 |
| 266620 | 2021-07-24 | F | N Hollywood | -0.248351 | W | -1.319019 | 0.441537 | 1.272705 | 1 |
| 189307 | 2020-12-26 | M | West Valley | 0.853224 | A | 1.191966 | 1.986946 | -0.785728 | 0 |
| 195530 | 2020-06-22 | F | West Valley | -1.544321 | H | 0.522370 | 0.132455 | -0.785728 | 0 |
4954 rows × 9 columns
We’ll let x-axis be Month, and y-axis be the Vict Age, use the predict value of kmeans be the color and shape, then we plot the chart.
alt.Chart(k_sub).mark_point(size = 100, filled = True).encode(
x="Month",
y="Vict Age",
color="cluster:N",
shape="cluster:N"
)
Make a list of charts, instead y-axis be the Vict Age, we’ll let y-axis be our cluster_column.
chart_list = []
for c in cluster_column:
chart = alt.Chart(k_sub).mark_point(size=100,filled=True).encode(
x="Month",
y=c,
color="cluster:N",
shape="cluster:N"
)
chart_list.append(chart)
alt.vconcat(*chart_list)
😄 Summary#
In my project, I utilize a single dataset containing crime data spanning from 2020 to the present day. My primary objectives are to leverage machine learning techniques for predicting the gender of the victims and employing a Linear Regression model to forecast the monthly total victim count. Additionally, I aim to discern whether the number of female victims exceeds that of male victims. To achieve these goals, I employ two distinct methods: K-Nearest Neighbors and Decision Trees, with the intention of determining which method yields the highest predictive accuracy. Also, I use K-Means to see how the cluster partition our data.
📝 References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
https://www.kaggle.com/datasets/susant4learning/crime-in-los-angeles-data-from-2020-to-present
List any other references that you found helpful.
https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8
https://medium.com/analytics-vidhya/los-angeles-crime-data-analysis-using-pandas-a68780d80a83
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html